MTA Ridership & Turnstile‑Hopping Data Visualization¶
MTA Open Data Challenge – Sep 2024 → Nov 2024
Author: Ibrahim Faruquee
Tools: Python 3.11, pandas, NumPy, matplotlib, seaborn, folium, contextily, scikit‑learn
Deliverables:
- Animated geographic scatterplots of hourly subway ridership
- Spatial join of turnstile‑hopping incidents with NYPD arrest records
- Interactive Folium map for stakeholder exploration
Project Overview¶
The goal of this project is to explore two complementary open‑data sets released by the MTA and NYPD:
| Dataset | Source | Key Fields |
|---|---|---|
| MTA Subway Hourly Ridership | https://data.ny.gov/Transportation/MTA-Subway-Hourly-Ridership-Beginning-February-202%E2%80%A6 | station_complex, latitude, longitude, transit_timestamp, ridership |
| NYPD Arrests (Turnstile‑Jumping) | https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b%E2%80%91rp9u/about_data | ARREST_KEY, ARREST_DATE, PD_DESC, Latitude, Longitude |
Objectives
- Visualize spatial‑temporal ridership patterns across the NYC subway network.
- Identify “turnstile‑hopping” incidents and associate each incident with its nearest subway station.
- Produce an enriched arrests table that includes the nearest station name.
- Deliver interactive visual assets (animated GIFs, Folium map) for stakeholder review.
# Uncomment for library installations
# %pip install pandas numpy matplotlib seaborn scipy folium ipython tqdm contextily pillow jupyterlab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.dates as mdates
import contextily as ctx
from matplotlib.colors import LinearSegmentedColormap
from IPython.display import Image, display
from scipy.spatial.distance import cdist
import folium
from folium.plugins import MarkerCluster
Load Ridership Data¶
# Load data from CSV file into a pandas DataFrame
df = pd.read_csv('clean_subway.csv')
df.sample(10)
| station_complex_id | transit_timestamp | transit_mode | station_complex | borough | latitude | longitude | Georeference | ridership | transfers | |
|---|---|---|---|---|---|---|---|---|---|---|
| 789851 | 277 | 2024-07-31 01:00:00 | subway | 7 Av (E,B,D) | Manhattan | 40.762863 | -73.981636 | POINT (-73.981636 40.762863) | 37 | 0 |
| 606508 | 214 | 2024-06-28 18:00:00 | subway | 182-183 Sts (B,D) | Bronx | 40.856094 | -73.900740 | POINT (-73.90074 40.856094) | 123 | 2 |
| 2032422 | 291 | 2024-03-27 21:00:00 | subway | Clinton-Washington Avs (G) | Brooklyn | 40.688087 | -73.966840 | POINT (-73.96684 40.688087) | 126 | 1 |
| 1072976 | 373 | 2024-04-03 01:00:00 | subway | E 149 St (6) | Bronx | 40.812120 | -73.904100 | POINT (-73.9041 40.81212) | 12 | 0 |
| 2221750 | 355 | 2024-08-13 12:00:00 | subway | Winthrop St (2,5) | Brooklyn | 40.656650 | -73.950200 | POINT (-73.9502 40.65665) | 229 | 7 |
| 2777453 | 66 | 2024-01-24 21:00:00 | subway | 18 Av (D) | Brooklyn | 40.607952 | -74.001740 | POINT (-74.00174 40.607952) | 38 | 0 |
| 1466312 | 620 | 2024-06-04 02:00:00 | subway | Court St (R)/Borough Hall (2,3,4,5) | Brooklyn | 40.693220 | -73.990000 | POINT (-73.99 40.69322) | 8 | 0 |
| 2377818 | 407 | 2024-01-01 23:00:00 | subway | Astor Pl (6) | Manhattan | 40.730053 | -73.991070 | POINT (-73.99107 40.730053) | 135 | 0 |
| 389416 | 143 | 2024-06-01 07:00:00 | subway | Inwood-207 St (A) | Manhattan | 40.868073 | -73.919900 | POINT (-73.9199 40.868073) | 192 | 19 |
| 2068311 | 301 | 2024-02-09 23:00:00 | subway | 181 St (1) | Manhattan | 40.849506 | -73.933590 | POINT (-73.93359 40.849506) | 76 | 2 |
print(df['station_complex_id'].value_counts())
df[df['station_complex_id'] == 76]
station_complex_id
628 17430
620 14267
606 13756
611 13654
617 12381
...
83 2660
84 2622
86 2573
85 2551
8 1604
Name: count, Length: 449, dtype: int64
| station_complex_id | transit_timestamp | transit_mode | station_complex | borough | latitude | longitude | Georeference | ridership | transfers | |
|---|---|---|---|---|---|---|---|---|---|---|
| 211204 | 76 | 2024-01-02 07:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 242 | 12 |
| 211205 | 76 | 2024-01-04 18:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 9 | 0 |
| 211206 | 76 | 2024-01-07 00:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 4 | 0 |
| 211207 | 76 | 2024-01-07 01:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 2 | 0 |
| 211208 | 76 | 2024-01-07 02:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 3 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2807410 | 76 | 2024-09-05 19:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 111 | 3 |
| 2807411 | 76 | 2024-09-05 20:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 70 | 2 |
| 2807412 | 76 | 2024-09-05 21:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 34 | 1 |
| 2807413 | 76 | 2024-09-05 22:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 38 | 1 |
| 2807414 | 76 | 2024-09-05 23:00:00 | subway | Bay Pkwy (N) | Brooklyn | 40.611816 | -73.98185 | POINT (-73.98185 40.611816) | 15 | 1 |
6766 rows × 10 columns
len(df)
2890213
df.drop_duplicates(inplace=True)
len(df)
2886421
# Convert transit_timestamp to datetime
df['transit_timestamp'] = pd.to_datetime(df['transit_timestamp'])
# Create custom colormap
colors = ['darkgreen', 'green', 'yellowgreen', 'yellow', 'orange', 'red', 'darkred']
n_bins = 100
cmap = LinearSegmentedColormap.from_list('custom', colors, N=n_bins)
Hourly Ridership in 2024 Scatterplot with Map Video link¶
Created an animated scatterplot that displays the ridership of every station hourly over the coarse of 2024. It uses colorful dots to represent each station's ridership, with red indicating high usage, yellow for average, and green for low. As the animation progresses, viewers can see how subway usage changes hour by hour, with the date and the time in the top left corner. Using contextily library to map the scatterplot points on a map. Map data is from OpenStreetMaps.
# Create an hourly timestamp column as integer
df['hour'] = df['transit_timestamp'].dt.floor('H')
# Group by station and hour, calculating mean hourly ridership
hourly_data = df.groupby(['station_complex', 'hour', 'latitude', 'longitude'])['ridership'].mean().reset_index()
# Calculate percentiles for normalization
low, high = np.percentile(hourly_data['ridership'], [5, 95])
# Create figure and axis
fig, ax = plt.subplots(figsize=(12, 8))
# Set up the plot limits based on your data
ax.set_xlim(df['longitude'].min(), df['longitude'].max())
ax.set_ylim(df['latitude'].min(), df['latitude'].max())
# Add a background map using contextily
ctx.add_basemap(ax, crs="EPSG:4326", source=ctx.providers.CartoDB.Positron)
# Initialize scatter plot with RdYlGn_r colormap
scatter = ax.scatter([], [], c=[], cmap='RdYlGn_r', alpha=0.6, s=50, vmin=low, vmax=high)
# Set up the plot
ax.set_title('NYC Subway Hourly Ridership in 2024', fontsize=16)
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)
# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Average Hourly Ridership', fontsize=12)
# Add timestamp text
timestamp_text = ax.text(0.02, 0.98, '', transform=ax.transAxes, fontsize=12, verticalalignment='top',
bbox=dict(facecolor='white', alpha=0.7, edgecolor='none'))
# Get unique hours for animation
unique_hours = sorted(hourly_data['hour'].unique())
# Animation update function
def update(frame):
current_hour = unique_hours[frame]
current_data = hourly_data[hourly_data['hour'] == current_hour]
scatter.set_offsets(current_data[['longitude', 'latitude']])
scatter.set_array(current_data['ridership'])
timestamp_text.set_text(current_hour.strftime('%Y-%m-%d %H:%M'))
return scatter, timestamp_text
# Create animation
ani = animation.FuncAnimation(fig, update, frames=len(unique_hours),
interval=200, blit=True)
plt.tight_layout()
plt.show()
# Uncomment the following line to save the animation as gif (requires pillow)
ani.save('nyc_subway_hourly_ridership_in_2024_with_map.gif', writer='pillow', fps=20)
print("Animation saved successfully.")
Animation saved successfully.
# Unable to display gif. File too large.
# display(Image(filename='nyc_subway_hourly_ridership_in_2024_with_map.gif'))
Average Hourly Rideship with Map Video Link¶
Created an animated scatterplot of average hourly ridership. It averages together the 2024 data to show how ridership patterns change throughout an average day, using color-coded dots to represent each station's average ridership for each hour. The animation cycles through 24 hours, updating the map to reflect ridership levels at each station. A text label in the corner updates to show the current hour. Using contextily library to map the scatterplot points on a map. Map data is from OpenStreetMaps.
# Create an hourly timestamp column as datetime
df['hour'] = df['transit_timestamp'].dt.hour
# Group by station and hour, calculating mean hourly ridership
hourly_data = df.groupby(['station_complex', 'hour', 'latitude', 'longitude'])['ridership'].mean().reset_index()
# Calculate percentiles for normalization
low, high = np.percentile(hourly_data['ridership'], [5, 95])
# Create figure and axis
fig, ax = plt.subplots(figsize=(12, 8))
# Set up the plot limits based on your data
ax.set_xlim(df['longitude'].min(), df['longitude'].max())
ax.set_ylim(df['latitude'].min(), df['latitude'].max())
ax.set_title('NYC Subway Average Hourly Ridership', fontsize=16)
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)
# Add a background map using contextily with OpenStreetMap
ctx.add_basemap(ax, crs="EPSG:4326", source=ctx.providers.OpenStreetMap.Mapnik)
# Initialize scatter plot
scatter = ax.scatter([], [], c=[], cmap=cmap, alpha=0.6, s=50, vmin=low, vmax=high)
# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Average Hourly Ridership', fontsize=12)
# Add hour text
hour_text = ax.text(0.02, 0.98, '', transform=ax.transAxes, fontsize=12, verticalalignment='top')
# Animation update function
def update(frame):
current_hour = frame
current_data = hourly_data[hourly_data['hour'] == current_hour]
scatter.set_offsets(current_data[['longitude', 'latitude']])
scatter.set_array(current_data['ridership'])
hour_text.set_text(f'Hour: {current_hour:02d}:00')
return scatter, hour_text
# Create animation
ani = animation.FuncAnimation(fig, update, frames=24, interval=500, blit=True)
plt.tight_layout()
plt.show()
# Save animation
ani.save('nyc_subway_average_hourly_ridership_with_map.gif', writer='pillow', fps=2)
print("Animation saved successfully.")
Animation saved successfully.
display(Image(filename='nyc_subway_average_hourly_ridership_with_map.gif'))
<IPython.core.display.Image object>
Average Hourly Ridership with Graph of Top 10 Stations with Most Ridership Video Link¶
# Create an hourly timestamp column as datetime
df['hour'] = df['transit_timestamp'].dt.hour
# Group by station and hour, calculating mean hourly ridership
hourly_data = df.groupby(['station_complex', 'hour', 'latitude', 'longitude'])['ridership'].mean().reset_index()
# Calculate percentiles for normalization
low, high = np.percentile(hourly_data['ridership'], [5, 95])
# Determine fixed limits for the bar chart based on overall data
overall_min_ridership = hourly_data['ridership'].min()
overall_max_ridership = hourly_data['ridership'].max()
# Create figure and axes for scatter plot and bar chart
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))
# Set up the plot limits based on your data for scatter plot
ax1.set_xlim(df['longitude'].min(), df['longitude'].max())
ax1.set_ylim(df['latitude'].min(), df['latitude'].max())
ax1.set_title('NYC Subway Average Hourly Ridership', fontsize=16)
ax1.set_xlabel('Longitude', fontsize=12)
ax1.set_ylabel('Latitude', fontsize=12)
# Add a background map using contextily with OpenStreetMap
ctx.add_basemap(ax1, crs="EPSG:4326", source=ctx.providers.OpenStreetMap.Mapnik)
# Initialize scatter plot
scatter = ax1.scatter([], [], c=[], cmap=cmap, alpha=0.6, s=50, vmin=low, vmax=high)
# Add colorbar to scatter plot
cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('Average Hourly Ridership', fontsize=12)
# Add hour text to scatter plot
hour_text = ax1.text(0.02, 0.98, '', transform=ax1.transAxes, fontsize=12, verticalalignment='top')
# Animation update function
def update(frame):
current_hour = frame
current_data = hourly_data[hourly_data['hour'] == current_hour]
# Update scatter plot data
scatter.set_offsets(current_data[['longitude', 'latitude']])
scatter.set_array(current_data['ridership'])
# Update hour text
hour_text.set_text(f'Hour: {current_hour:02d}:00')
# Get top 10 stations by ridership for the current hour
top_stations = current_data.nlargest(10, 'ridership')
# Clear previous bar chart and plot new data
ax2.clear()
ax2.barh(top_stations['station_complex'], top_stations['ridership'], color='skyblue')
ax2.set_title(f'Top 10 Stations - Hour: {current_hour:02d}:00', fontsize=16)
ax2.set_xlabel('Average Hourly Ridership', fontsize=12)
# Set consistent x-axis limits for the bar chart based on overall data
ax2.set_xlim(overall_min_ridership * 0.9, overall_max_ridership * 1.1) # Adjust margins as needed
ax2.invert_yaxis() # Invert y-axis to have the highest ridership on top
return scatter, hour_text
# Create animation
ani = animation.FuncAnimation(fig, update, frames=24, interval=500, blit=True)
plt.tight_layout()
plt.show()
# Save animation
ani.save('nyc_subway_average_hourly_ridership_with_map_and_graph.gif', writer='pillow', fps=2)
print("Animation saved successfully.")
Animation saved successfully.
display(Image(filename='nyc_subway_average_hourly_ridership_with_map_and_graph.gif'))
<IPython.core.display.Image object>
Build Folium Map¶
arrests_df = pd.read_csv('NYPD_Arrests_small.csv')
subway_df = pd.read_csv('clean_subway.csv')
# Create a new dataframe with only the columns that have THEFT OF SERVICES in the PD_DESC column
turnstile_jumps = arrests_df[arrests_df['PD_DESC'].str.contains('THEFT OF SERVICES', case=False, na=False)]
len(turnstile_jumps)
turnstile_jumps.sample(5)
| ARREST_KEY | ARREST_DATE | PD_CD | PD_DESC | KY_CD | OFNS_DESC | LAW_CODE | LAW_CAT_CD | ARREST_BORO | ARREST_PRECINCT | JURISDICTION_CODE | AGE_GROUP | PERP_SEX | PERP_RACE | X_COORD_CD | Y_COORD_CD | Latitude | Longitude | Lon_Lat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2387 | 278882884 | 12/12/2023 | 478 | THEFT OF SERVICES, UNCLASSIFIE | 343.0 | OTHER OFFENSES RELATED TO THEF | PL 1651503 | M | M | 17 | 0 | 45-64 | M | BLACK | 991591 | 210258 | 40.743783 | -73.973508 | POINT (-73.97350763680087 40.74378315276204) |
| 1705 | 278501629 | 12/05/2023 | 478 | THEFT OF SERVICES, UNCLASSIFIE | 343.0 | OTHER OFFENSES RELATED TO THEF | PL 1651503 | M | B | 41 | 1 | 25-44 | M | BLACK | 1014509 | 238113 | 40.820189 | -73.890675 | POINT (-73.8906752472269 40.820188951147415) |
| 3354 | 277216547 | 11/08/2023 | 478 | THEFT OF SERVICES, UNCLASSIFIE | 343.0 | OTHER OFFENSES RELATED TO THEF | PL 1651503 | M | B | 40 | 1 | 25-44 | M | BLACK | 1007046 | 236603 | 40.816067 | -73.917644 | POINT (-73.91764392674679 40.81606683945181) |
| 263 | 279512771 | 12/26/2023 | 478 | THEFT OF SERVICES, UNCLASSIFIE | 343.0 | OTHER OFFENSES RELATED TO THEF | PL 1651503 | M | B | 41 | 1 | 25-44 | M | BLACK | 1010774 | 235171 | 40.812126 | -73.904181 | POINT (-73.90418129247784 40.812125995016096) |
| 2062 | 279447628 | 12/23/2023 | 478 | THEFT OF SERVICES, UNCLASSIFIE | 343.0 | OTHER OFFENSES RELATED TO THEF | PL 1651503 | M | M | 19 | 0 | 25-44 | M | BLACK HISPANIC | 995369 | 221138 | 40.773642 | -73.959856 | POINT (-73.9598555143181 40.77364202322589) |
CSV File for 2nd Challenge¶
Found the nearest subway station for each turnstile jumping incident based on their geographic coordinates. It calculates the distance between the incident locations and all subway stations, then counts how many incidents occurred at each station. The results, including the station names and their coordinates, are saved into a result_df dataframe. Additionally, it creates an enriched dataset of the arrest data, which includes details about each incident along with the nearest subway station, and exports this information to a CSV file. The result_df dataframe is used later to create a map with the number of turnstile jumps at each station. The exported CSV file completes the second challenge.
def find_nearest_station(lat, lon, stations):
arrest_coords = np.array([[lat, lon]])
station_coords = stations[['latitude', 'longitude']].values
distances = cdist(arrest_coords, station_coords)
nearest_index = distances.argmin()
return stations.iloc[nearest_index]['station_complex']
# Apply function to get nearest station for each arrest
turnstile_jumps['nearest_station'] = turnstile_jumps.apply(
lambda row: find_nearest_station(row['Latitude'], row['Longitude'], subway_df),
axis=1
)
# Count turnstile jumpings per station
jumps_per_station = turnstile_jumps['nearest_station'].value_counts().reset_index()
jumps_per_station.columns = ['station_complex', 'turnstile_jumpings']
# Merge with subway data to get coordinates
result_df = pd.merge(jumps_per_station, subway_df[['station_complex', 'latitude', 'longitude']], on='station_complex', how='left')
result_df = result_df.drop_duplicates(subset=['station_complex'], ignore_index=True)
# Select relevant columns for the enriched arrests data
enriched_arrests = turnstile_jumps[['ARREST_KEY', 'ARREST_DATE', 'PD_DESC', 'ARREST_BORO', 'Latitude', 'Longitude', 'nearest_station']]
# Export enriched arrests data
enriched_arrests.to_csv('enriched_turnstile_jumping_arrests.csv', index=False)
print("Exported 'enriched_turnstile_jumping_arrests.csv'")
Exported 'enriched_turnstile_jumping_arrests.csv'
# Dataset only goes from October to December 2023
enriched_arrests
enriched_arrests['ARREST_DATE'].agg(['min', 'max'])
min 10/09/2023 max 12/31/2023 Name: ARREST_DATE, dtype: object
result_df
| station_complex | turnstile_jumpings | latitude | longitude | |
|---|---|---|---|---|
| 0 | 3 Av-149 St (2,5) | 12 | 40.816110 | -73.917755 |
| 1 | Times Sq-42 St (N,Q,R,W,S,1,2,3,7)/42 St (A,C,E) | 8 | 40.757310 | -73.987495 |
| 2 | Tremont Av (B,D) | 7 | 40.850410 | -73.905230 |
| 3 | Rockaway Av (C) | 7 | 40.678340 | -73.911950 |
| 4 | Ralph Av (C) | 5 | 40.678820 | -73.920784 |
| ... | ... | ... | ... | ... |
| 66 | Newkirk Av-Little Haiti (2,5) | 1 | 40.639970 | -73.948410 |
| 67 | 96 St (1,2,3) | 1 | 40.793920 | -73.972320 |
| 68 | Morrison Av-Soundview (6) | 1 | 40.829520 | -73.874520 |
| 69 | 125 St (4,5,6) | 1 | 40.804140 | -73.937590 |
| 70 | Middletown Rd (6) | 1 | 40.843864 | -73.836320 |
71 rows × 4 columns
Map with Number of Turnstile Jumps at Each Station using Folium¶
# Create a map centered on the mean coordinates of the arrests
center_lat = result_df['latitude'].mean()
center_lon = result_df['longitude'].mean()
nyc_map = folium.Map(location=[center_lat, center_lon], zoom_start=11)
# Create a MarkerCluster
marker_cluster = MarkerCluster().add_to(nyc_map)
# Normalize marker sizes
max_hoppers = result_df['turnstile_jumpings'].max()
min_hoppers = result_df['turnstile_jumpings'].min()
min_marker_size = 5
max_marker_size = 20
# Add markers for each unique station
for idx, row in result_df.iterrows():
# Calculate marker size
if max_hoppers == min_hoppers:
marker_size = (min_marker_size + max_marker_size) / 2
else:
marker_size = min_marker_size + ((row['turnstile_jumpings'] - min_hoppers) / (max_hoppers - min_hoppers)) * (max_marker_size - min_marker_size)
# Create popup text
popup_text = f"Station: {row['station_complex']}<br>Turnstile Jumpings: {row['turnstile_jumpings']}"
# Add marker to the cluster
folium.CircleMarker(
location=[row['latitude'], row['longitude']],
radius=marker_size,
popup=popup_text,
tooltip=row['station_complex'],
color='red',
fill=True,
fillColor='red',
fillOpacity=0.7
).add_to(marker_cluster)
# Save the map
nyc_map.save("turnstile_hoppers_map_oct_to_dec_2023.html")
display(nyc_map)